Skip to content

Use a private task-local RNG for kernel launch seeds#3161

Merged
maleadt merged 1 commit into
JuliaGPU:mainfrom
JohnCobbler:launch-seed-rng
Jun 5, 2026
Merged

Use a private task-local RNG for kernel launch seeds#3161
maleadt merged 1 commit into
JuliaGPU:mainfrom
JohnCobbler:launch-seed-rng

Conversation

@JohnCobbler

@JohnCobbler JohnCobbler commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Fixes #2417.

Problem

make_seed(::HostKernel) draws from Julia's default RNG on every kernel launch, so launching a kernel advances the user-visible rand() stream:

Random.seed!(42); a = rand()
Random.seed!(42); @cuda kernel(); b = rand()
a != b  # surprising

Fix

Following the direction suggested in the issue ("probably better to maintain a local RNG in CUDA.jl for launching kernels"), make_seed now draws from a private task-local Xoshiro that is lazily created in task_local_storage():

launch_rng() = get!(Random.Xoshiro, task_local_storage(),
                    :CUDACore_launch_rng)::Random.Xoshiro

make_seed(::HostKernel) = rand(launch_rng(), UInt32)

Xoshiro() seeds itself from system entropy without touching the default RNG, so the launch RNG is fully decoupled from the default RNG in both directions:

  • launches no longer perturb the user's rand() stream, and
  • Random.seed!(...) no longer influences which seeds reach kernels (device-side reproducibility remains available via in-kernel Random.seed!, which is unchanged).

Task-local storage was chosen over a module-global RNG to stay thread-safe without locking on the launch path, mirroring how Julia's own default RNG is per-task. The get!(factory, dict, key) form matches the existing devices() helpers in lib/cublas / lib/cusolver, and the ::Random.Xoshiro assertion keeps the launch path type-stable (0 allocations on the fast path). Device-side seeding (make_seed(::DeviceKernel) and device/random.jl's Philox2x32) is untouched.

An alternative considered: a global atomic counter (Philox keys only need to be distinct, not random, so sequential keys would give independent streams). That would make seeds deterministic across runs, which is a bigger semantic change — happy to switch if that behavior is preferred.

Testing

Added a regression test in test/core/execution.jl asserting the host RNG stream is identical with and without an interleaved launch, and that consecutive launches still receive distinct seeds. Verified on a Quadro RTX 6000 (sm_75, Julia 1.11.9): new testset passes (6/6) and the existing device-side RNG tests are unaffected (11/11).

Possible follow-up (out of scope here)

If deterministic launch seeds are ever wanted (e.g. for debugging), a CUDA.seed_launch!(seed) that reseeds the task-local launch RNG would slot in cleanly on top of this — happy to do that as a separate PR if there's interest.

@maleadt

maleadt commented Jun 5, 2026

Copy link
Copy Markdown
Member

If deterministic launch seeds are ever wanted (e.g. for debugging), a CUDA.seed_launch!(seed) that reseeds the task-local launch RNG would slot in cleanly on top of this — happy to do that as a separate PR if there's interest.

Not needed. The user can always put a seed! call in the kernel.

Comment thread CUDACore/src/compiler/execution.jl Outdated
Comment on lines +476 to +480
# Task-local RNG used solely to seed kernel launches. Drawing launch seeds
# from a private RNG (rather than the task-global default) ensures that a kernel
# launch never perturbs the user-visible `rand()` stream. Lazily created per
# task; `Xoshiro()` seeds itself from system entropy without touching the
# default RNG. (JuliaGPU/CUDA.jl#2417)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please de-LLM some of these comments: no need to refer to the previous state, #2417 was a TODO so not worth pointing to, etc.

@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.32%. Comparing base (aa47d7a) to head (88072e4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3161      +/-   ##
==========================================
- Coverage   16.33%   16.32%   -0.02%     
==========================================
  Files         124      124              
  Lines        9875     9875              
==========================================
- Hits         1613     1612       -1     
- Misses       8262     8263       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 88072e4 Previous: aa47d7a Ratio
array/accumulate/Float32/1d 99401 ns 98976 ns 1.00
array/accumulate/Float32/dims=1 75801 ns 75470 ns 1.00
array/accumulate/Float32/dims=1L 1596618 ns 1595788 ns 1.00
array/accumulate/Float32/dims=2 140800 ns 140419 ns 1.00
array/accumulate/Float32/dims=2L 653847 ns 653444 ns 1.00
array/accumulate/Int64/1d 118385 ns 118155 ns 1.00
array/accumulate/Int64/dims=1 79035 ns 78907 ns 1.00
array/accumulate/Int64/dims=1L 1708144 ns 1709506 ns 1.00
array/accumulate/Int64/dims=2 153801 ns 153939 ns 1.00
array/accumulate/Int64/dims=2L 959403 ns 959330 ns 1.00
array/broadcast 18271 ns 18270 ns 1.00
array/construct 1222.3 ns 1198.4 ns 1.02
array/copy 16408 ns 16676 ns 0.98
array/copyto!/cpu_to_gpu 212320 ns 211135 ns 1.01
array/copyto!/gpu_to_cpu 279697 ns 278832 ns 1.00
array/copyto!/gpu_to_gpu 10291 ns 10531 ns 0.98
array/iteration/findall/bool 133390 ns 131993 ns 1.01
array/iteration/findall/int 147460 ns 146745 ns 1.00
array/iteration/findfirst/bool 111639 ns 111631 ns 1.00
array/iteration/findfirst/int 112010 ns 111858 ns 1.00
array/iteration/findmin/1d 65743 ns 66902 ns 0.98
array/iteration/findmin/2d 100431 ns 100550 ns 1.00
array/iteration/logical 191970 ns 189124 ns 1.02
array/iteration/scalar 64769 ns 66015 ns 0.98
array/permutedims/2d 49410 ns 49598 ns 1.00
array/permutedims/3d 50855 ns 50240 ns 1.01
array/permutedims/4d 50619 ns 50411 ns 1.00
array/random/rand/Float32 11524 ns 11982 ns 0.96
array/random/rand/Int64 24234 ns 23515 ns 1.03
array/random/rand!/Float32 7935 ns 8122 ns 0.98
array/random/rand!/Int64 20840 ns 20501 ns 1.02
array/random/randn/Float32 34692 ns 34458 ns 1.01
array/random/randn!/Float32 23818 ns 24130 ns 0.99
array/reductions/mapreduce/Float32/1d 33469 ns 33763 ns 0.99
array/reductions/mapreduce/Float32/dims=1 38081 ns 38228 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 49795 ns 50249 ns 0.99
array/reductions/mapreduce/Float32/dims=2 55488 ns 55439 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 67122 ns 67163 ns 1.00
array/reductions/mapreduce/Int64/1d 39352 ns 40187 ns 0.98
array/reductions/mapreduce/Int64/dims=1 41051 ns 40738 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 86289 ns 86458 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57638 ns 57724 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 82659 ns 82751 ns 1.00
array/reductions/reduce/Float32/1d 33347 ns 33392 ns 1.00
array/reductions/reduce/Float32/dims=1 38115 ns 38091 ns 1.00
array/reductions/reduce/Float32/dims=1L 49855 ns 49963 ns 1.00
array/reductions/reduce/Float32/dims=2 55304 ns 55491 ns 1.00
array/reductions/reduce/Float32/dims=2L 68959 ns 68898 ns 1.00
array/reductions/reduce/Int64/1d 38812 ns 40022 ns 0.97
array/reductions/reduce/Int64/dims=1 40602 ns 40672 ns 1.00
array/reductions/reduce/Int64/dims=1L 86182 ns 86301 ns 1.00
array/reductions/reduce/Int64/dims=2 57354 ns 57311 ns 1.00
array/reductions/reduce/Int64/dims=2L 82719 ns 82411 ns 1.00
array/reverse/1d 16735 ns 16807 ns 1.00
array/reverse/1dL 67674 ns 67676 ns 1.00
array/reverse/1dL_inplace 65157 ns 65187 ns 1.00
array/reverse/1d_inplace 8217 ns 9321.666666666666 ns 0.88
array/reverse/2d 19752 ns 19959 ns 0.99
array/reverse/2dL 71559 ns 71879 ns 1.00
array/reverse/2dL_inplace 64905 ns 65104 ns 1.00
array/reverse/2d_inplace 9515 ns 11067 ns 0.86
array/sorting/1d 2650235 ns 2655417 ns 1.00
array/sorting/2d 1040002 ns 1038734 ns 1.00
array/sorting/by 3192735 ns 3192232 ns 1.00
cuda/synchronization/context/auto 1131.6 ns 1131.5 ns 1.00
cuda/synchronization/context/blocking 904.6666666666666 ns 952.2173913043479 ns 0.95
cuda/synchronization/context/nonblocking 5874 ns 6097.8 ns 0.96
cuda/synchronization/stream/auto 992.25 ns 1004.5 ns 0.99
cuda/synchronization/stream/blocking 811.2 ns 825.6363636363636 ns 0.98
cuda/synchronization/stream/nonblocking 5921 ns 6045.333333333333 ns 0.98
integration/byval/reference 143128 ns 143141 ns 1.00
integration/byval/slices=1 145262 ns 145110 ns 1.00
integration/byval/slices=2 283680 ns 283495 ns 1.00
integration/byval/slices=3 422143 ns 422045 ns 1.00
integration/cudadevrt 101447 ns 101557 ns 1.00
integration/volumerhs 9078485 ns 9077766 ns 1.00
kernel/indexing 12466 ns 12534 ns 0.99
kernel/indexing_checked 13466 ns 13291 ns 1.01
kernel/launch 2040.2222222222222 ns 2072.5555555555557 ns 0.98
kernel/occupancy 699.2517006802722 ns 716.9097744360902 ns 0.98
kernel/rand 13666 ns 13723 ns 1.00
latency/import 3845165278 ns 3841987489 ns 1.00
latency/precompile 4621357772 ns 4621684240 ns 1.00
latency/ttfp 4491670603 ns 4482964065 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@JohnCobbler JohnCobbler force-pushed the launch-seed-rng branch 2 times, most recently from 899cf3b to 88072e4 Compare June 5, 2026 13:26
Kernel launches used to draw their seed from the default RNG, perturbing
the user-visible rand() stream. Use a lazily-created task-local Xoshiro
instead. Includes a regression test.
@maleadt maleadt merged commit 112549e into JuliaGPU:main Jun 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA kernel launches should use their own RNG

2 participants